Verbal lie detection using Large Language Models

Human accuracy in detecting deception with intuitive judgments has been proven to not go above the chance level. Therefore, several automatized verbal lie detection techniques employing Machine Learning and Transformer models have been developed to reach higher levels of accuracy. This study is the first to explore the performance of a Large Language Model, FLAN-T5 (small and base sizes), in a lie-detection classification task in three English-language datasets encompassing personal opinions, autobiographical memories, and future intentions. After performing stylometric analysis to describe linguistic differences in the three datasets, we tested the small- and base-sized FLAN-T5 in three Scenarios using 10-fold cross-validation: one with train and test set coming from the same single dataset, one with train set coming from two datasets and the test set coming from the third remaining dataset, one with train and test set coming from all the three datasets. We reached state-of-the-art results in Scenarios 1 and 3, outperforming previous benchmarks. The results revealed also that model performance depended on model size, with larger models exhibiting higher performance. Furthermore, stylometric analysis was performed to carry out explainability analysis, finding that linguistic features associated with the Cognitive Load framework may influence the model’s predictions.

The promising results in applying NLP techniques for psychological research suggest the possibility of combining metrics from different psychological frameworks in a new theory-based stylometric analysis, offering the possibility to investigate verbal lie detection from multiple perspectives in one shot.

Related works in the AI field
Previous works from the AI field have applied machine learning and deep learning models in a binary classification task for data-driven verbal deception detection.
Kleinberg and Verschuere 49 developed a database of future intentions to investigate whether combining machine and human judgments may improve accuracy in predicting deception.While finding that human judgment impairs automated deception detection accuracy, the authors implemented two machine learning models (i.e., vanilla random forest) trained respectively on LIWC and Part-of-Speech features (e.g., frequency of names, adjectives, adverbs, verbs) reaching an accuracy of 69% (95% CI: 63-74%) and 64 (95% CI: 58%, 69%), respectively.On the same dataset, Ilias et al. 50evaluated six deep-learning models, including combinations of BERT (and RoBERTa), MultiHead Attention, co-attentions, and Transformers models.The best accuracy reached was 70.61% (± 2.58%) using a BERT with co-attention model.The authors also provided explainaibility analysis to understand how the models reached their decisions using a combination of LIME (a tool used to explain deep learning predictions in more straightforward and understandable terms by showing which specific words of the text influenced the outcome) and LIWC.
Capuozzo et al. 51 developed a new cross-domain and cross-language dataset of opinions, asking Englishspeaking and Italian-speaking participants to provide truthful or deceptive opinions on five different topics.After encoding the texts with FastText word-embedding, they trained Transformers models in multiple scenarios using 10-fold cross-validation, with averaged accuracy ranging from 63% (± 8.7%) in the "within-topic" scenario to a high of 90.1% (± 0.16%) in the "author-based" scenario.
In contrast, Sap et al. 52 developed a new dataset of narratives generated from memories and imagination and used an LLM (GPT-3) to compute a new metric called "sequentiality".Sequentiality is a metric of narrative flow that compares the probability of a sentence with and without its preceding story context.While providing insights into the cognitive processes of storytelling with an innovative computational approach, the authors did not employ a fine-tuning procedure for an LLM to classify different narratives.
The findings in the AI domain indicate that as the model's complexity increases, there is a heightened accuracy in predicting deception from texts.However, this increase in accuracy often comes at the expense of explainability for these predictions.LLMs are currently among the most cutting-edge models capable of handling vast amounts and complexities of linguistic data, and the lack of literature on fine-tuning LLMs for lie-detection tasks provides worthwhile reasons to investigate this area.

Aims and hypotheses of the study
The main objectives and hypothesis of this study are outlined as follows: • Hypothesis 1a): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts, 1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.• Hypothesis 2): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts, 1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.• Hypothesis 3): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts, 1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.• Hypothesis 4): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts, 1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.• Hypothesis 5a): The linguistic style distinguishing truthful from deceptive statements varies across different contexts, 5b) and can be a significant feature for model prediction.
To test Hypothesis 1a, we fine-tuned an open-source LLM, FLAN-T5, using three datasets: personal opinions (the Deceptive Opinions dataset 51 ), autobiographical experiences (the Hippocorpus dataset 52 ) and future intentions (the Intention dataset 49 ).Given the extreme flexibility of LLMs, this approach is hypothesized to detect deception from raw texts above the chance level.To test the advantage of our approach compared to classical machine and deep learning models (Hypothesis 1b), we decided to compare the results with two benchmarks, further described in the Methods and Materials section.
With regards to Hypotheses 2 and 3, according to empirical evidence, classical machine learning models tend to experience a decline in performance when trained and tested on the aforementioned scenarios [53][54][55] .In contrast, LLMs have acquired a comprehensive understanding of language patterns during the pre-training phase.We posit that a fine-tuned LLM is capable of generalizing its learning across various contexts.Related to Hypothesis 4, we believe this generalization ability is further enhanced in larger models, as their size is associated with a more sophisticated representation of language.
Finally, to test Hypothesis 5, we introduced a new theory-based stylometric approach, named DeCLaRatiVE stylometry, to extract linguistic features related to the psychological frameworks of Distancing 29 , Cognitive Load 31 , Reality Monitoring 32 , and Verifiability Approach 40,41 , providing a pragmatic set of criteria to extract features from utterances.We will apply DeCLaRatiVE stylometry to compare truthful and deceptive statements in the three aforementioned datasets in order to explore potential differences in terms of linguistic style.Our hypothesis suggests that the linguistic style distinguishing truthful from deceptive statements may vary across the three datasets, as these types of statements originate from distinct contexts.We also applied the DeCLaRatiVE stylometry technique to provide explainability analysis of the top-performing model.

Datasets
Three datasets were employed for this study: the Deceptive Opinions dataset 51 , from now on Opinion Dataset, the Hippocorpus dataset 52 , from now on Memory Dataset, and the Intention dataset 49 .For each dataset, participants were required to provide genuine or fabricated statements in three different domains: personal opinions on five different topics (Opinion dataset), autobiographical experiences (Memory dataset), and future intentions (Intention Dataset).Notably, the specific topic within each domain was counterbalanced among liars and truth-tellers.A more detailed description of each dataset is available in Supplementary Information as well as in the method section of each original article.Table 1 displays an example of truthful and deceptive statements about opinions, memories, and intentions.Table 2 reports descriptive statistics for each dataset, both overall and when grouped by truthful and deceptive sets of statements.These statistics include the minimum, maximum, average, and standard deviation of word counts.Word counts were computed after text tokenization using spaCy, a Python library for text processing.Additionally, Table 2 provides Jaccard similarity index values between truthful and deceptive vocabulary sets.Jaccard's index was derived by calculating the intersection (common words) and union (total words) of these Table 1.Truthful and deceptive example statements about opinions, memories, and intentions.In brackets, the topic assigned to the participant in the deceptive condition to fabricate the narrative.

Opinion (Abortion)
While I am morally torn on the issue, I believe that ultimately it is a woman's body and she should be able to do with it as she pleases.I belive people should not dehumanize the fetus tough, to make themselves feel better.The decision about laws regarding this issue should be left up to the states to decide.To combat this problem, birth control should be easily accessible Abortion is the termination of a life and should not be al-lowed.If a fetus has made it to the point of being able to survive "on its own" outside its mother's body, what right do we have to cut its life short.If the mother's life is in danger, she already chose that she was willing to sacrifice her life to have a child when she consented to procreating Memory (My boyfriend and I went to a concert together and had a great time.We met some of my friends there and really enjoyed ourselves watching the sunset.) The day started perfectly, with a great drive up to Denver for the show.Me and my boyfriend didn't hit any traffic on the way to Red Rocks, and the weather was beautiful.We met up with my friends at the show, near the top of the theater, and laid down a blanket.The opener came on, and we danced our butts off to the banjoes and mandolins that were playing on-stage.We were so happy to be there.That's when the sunset started.It was so beautiful.The sky was a pastel pink and was beautiful to watch.That's when Phil Lesh came on, and I just about died.It was the happiest moment of my life, seeing him after almost a decade of not seeing him.I was so happy to be there, with my friends and my love.There was nothing that could top that night.We drove home to a sky full of stars and stopped at an overlook to look up at them.I love this place I live.And I love live music.I was so happy Concerts are my most favorite thing, and my boyfriend knew it.That's why, for our anniversary, he got me tickets to see my favorite artist.Not only that, but the tickets were for an outdoor show, which I love much more than being in a crowded stadium.Since he knew I was such a big fan of music, he got tickets for himself, and even a couple of my friends.He is so incredibly nice and considerate to me and what I like to do.I will always remember this event and I will always cherish him.On the day of the concert, I got ready, and he picked me up and we went out to a restaurant beforehand.He is so incredibly romantic.He knew exactly where to take me without asking.We ate, laughed, and had a wonderful dinner date before the big event.We arrived at the concert and the music was so incredibly beautiful.I loved every minute of it.My friends, boyfriend, and I all sat down next to each other.As the music was slowly dying down, I found us all getting lost just staring at the stars.It was such an incredibly unforgettable and beautiful night Intention (Going swimming with my daughter) We go to a Waterbabies class every week, where my 16-month-old is learning to swim.We do lots of activities in the water, such as learning to blow bubbles, using floats to aid swimming, splashing and learning how to save themselves should they ever fall in.I find this activity important as I enjoy spending time with my daughter and swimming is an important life skill I will be taking my 8-year-old daughter swimming this Saturday.We'll be going early in the morning, as it's generally a lot quieter at that time, and my daughter is always up early watching cartoons anyway (5 am!).I'm trying to teach her how to swim in the deep end before she starts her new school in September as they have swimming lessons there twice a week www.nature.com/scientificreports/two sets 50,56 .The resulting index ranges from 0 to 1, with 0 indicating a completely different vocabulary between the two sets, and 1 indicating a completely identical vocabulary between the two sets.We reported the Jaccard similarity index to provide a measure of similarity or overlap between the word choices of truthful and deceptive statements within the respective datasets.Supplementary Information offers a detailed methodology for calculating the Jaccard similarity index.

FLAN-T5
We adopted FLAN-T5, an LLM developed by Google researchers and freely available through HuggingFace Python's library Transformers (https:// huggi ngface.co/ docs/ trans forme rs/ model_ doc/ flan-t5).HugginFace is a company that provides free access to state-of-the-art LLMs through Python API.Among the available LLMs, we chose FLAN-T5 because of its valuable trade-off between computational load and goodness of the learned representation.FLAN-T5 is the improved version of MT-5, a text-to-text general model capable of solving many NLP tasks (e.g., sentiment analysis, question answering, and machine translation), which has been improved by pre-training 57 .The peculiarity of this model is that every task they were trained on is transformed into a textto-text task.For example, while performing sentiment analysis, the output prediction is the string used in the training set to label the positive or negative sentiment of each phrase rather than a binary integer output (e.g., 0 = positive; 1 = negative).Hence, their power stands in both the generalized representation of natural language learned during the pre-training phase and the possibility of easily adapting the model to a downstream task with little fine-tuning without adjusting its architecture.

DeCLaRatiVE stylometric analysis
This study employed stylometric analysis to achieve two primary objectives.First, we aimed to describe the linguistic features that distinguished the three datasets before initializing the fine-tuning process.Second, we conducted explainability analysis to gain insights into the role of linguistic style that differentiated truthful and deceptive statements in the model's classification process.For this purpose, a new framework that we referred to as DeCLaRatiVE stylometry was adopted, which involved the extraction of 26 linguistic features in conjunction with the psychological frameworks of Distancing 29 , Cognitive Load 30,31 , Reality Monitoring 32,34 , and VErifiability Approach 40,41 .A full list of the 26 linguistic features with a short description is shown in Table 3.This comprehensive approach enabled the analysis of verbal cues of deception from a multidimensional perspective.
Features associated with the CL framework consisted of statistics about the length, readability, and complexity of the text 14,58-60 and were extracted using the Python library TEXTSTAT.Features related to the Distancing and RM framework were computed using LIWC 42,43 , the gold standard software for analyzing word usage.Using the English dictionary, we scored each text along with all the categories present in LIWC-22.LIWC scoring was computed on tokenized text using the English dictionary.The selection of the LIWC categories related to the Distancing and RM framework was guided by previous research on computerized verbal lie-detection 29,49,50,52,56 and a recent metanalysis 14 .RM was also investigated through linguistic concreteness of words 39 .To determine the average level of concreteness for each statement, we utilized the concreteness annotation dataset developed by Brysbaert et al. 61 .For the calculation of concreteness scores, a preprocessing pipeline was applied to textual data using the Python library SpaCy: text was converted to lowercase and tokenized; then stop words were removed, and the remaining content words were lemmatized.These content words were then cross-referenced with the annotated concreteness dataset to assign the respective concreteness value when a match was found.
The concreteness score for each statement was then computed as the average of the concreteness scores for all the content words in that statement.For what concerns verifiable details, they were estimated by the frequency of unique named entities.Named entities were extracted with the NER technique using Python's library SpaCy through the Transformer algorithm for English language (en_core_web_trf, https:// spacy.io/ models/ en# en_ core_ web_ trf).Further details on how the 26 linguistic features were computed are provided in the Supplementary Information.

Experimental set-up
In this section, we describe the methodology that we applied in this work.As a first step, we wanted to perform a descriptive linguistic analysis of our datasets, trying to provide a response to Hypothesis 5a), i.e., whether the linguistic style distinguishing truthful from deceptive statements varies across different contexts.To achieve this result, we employed the DeCLaRatiVE stylometric analysis.As a second step, we proceeded to test the capacity of the FLAN-T5 model to be fine-tuned on a Lie Detection task.To do so, we provided three scenarios to verify the following hypothesis: • Hypothesis 1a): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts, 1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.• Hypothesis 2): Fine-tuning an LLM on deceptive narratives enables the model to also detect new types of deception; • Hypothesis 3): Fine-tuning an LLM on deceptive narratives enables the model to also detect new types of deception; • Hypothesis 4): Model performance depends on model size, with larger models showing higher accuracy; We expected hypotheses 1a, 1b, 3, and 4 to be verified, while we did not have any a priori expectation for the second hypothesis.The scenarios are described below: 1. Scenario 1: The model was fine-tuned and tested on a single dataset.This procedure was repeated for each dataset with a different copy of the same model each time (i.e., the same parameters before the fine-tuning process) (Fig. 1).This Scenario assesses the model's capacity to learn how to detect lies related to the same context and responds to Hypothesis 1a; 2. Scenario 2: The model was fine-tuned on two out of the three datasets and tested on the remaining unseen dataset.As for the previous Scenario, this procedure was iterated three times, employing separate instances of the same model, each time with a distinct combination of dataset pairings (Fig. 2).This Scenario assesses how the model performs on samples from a new context to which it has never been exposed during the training phase and provides a response for Hypothesis 2; 3. Scenario 3: We first aggregated the three train and test sets from Scenario 1.Then, we fine-tuned the model on the aggregated datasets and tested the model on the aggregated test sets (Fig. 1).This Scenario assesses the capacity of the model to learn and generalize from samples of truthful and deceptive narratives from multiple contexts and provides a response for Hypothesis 3.
In Scenarios 1 and 3, each experiment underwent a 10-fold cross-validation.N-fold cross-validation is a statistical method used to estimate the performance of a model by dividing the dataset into n partitions (n = 10 for this study).For each partition i, we created a training set composed of the remaining n−1 partitions using the i partition as a test set (i.e., 90% of the data belongs to the training set, and 10% of the remaining data belongs to the test set).For each iteration, performance metrics are computed on the test set, stored, and then averaged.This procedure ensures an unbiased performance estimation and allows a fair comparison between different models.For our study, we employed identical train-test splits within scenarios 1 and 3 and for both model sizes to guarantee a fair performance comparison.The average test accuracy from each fold and its corresponding standard deviation are presented as performance metrics.Conversely, in Scenario 2, each pairing combination underwent fine-tuning using the entire two paired datasets as a training set, while the model's performance was assessed using the complete unseen dataset as a test set.
Notably, the Opinion dataset was developed to have each participant's truthful and deceptive statements for a total of five opinions.Therefore, we treated each opinion as a separate sample.In order to avoid the model exhibiting inflated performance on the test set as a result of learning the participants' linguistic style, we adopted the following precautionary measure.Specifically, we ensured an exclusive division of participants between the Together, Scenarios 2 and 3 provide evidence about the generalized capabilities of the fine-tuned FLAN-T5 model in a lie-detection task when tested on unseen data and on a multi-domain dataset.Furthermore, we tested whether model performance may depend on model sizes.Therefore, we first fine-tuned the small-sized version of FLAN-T5 in every scenario, and then we repeated the same experiments in every scenario with the base-sized version, providing a response for Hypothesis 4.
To test Hypothesis 1b, i.e., to test the advantage of our approach when compared to classical machine learning models, we decided to compare the results with two benchmarks: 1.A basic approach consisting of a bag-of-words (BoW) encoder plus a logistic regression classifier 62 (following the experimental procedure of Scenario 1);  www.nature.com/scientificreports/ 2. A literature baseline based on previous studies providing accuracy metrics on the same datasets using a machine learning or a deep learning approach [49][50][51] .For the Opinion dataset -characterized by opinions on five different topics per subject-we compared our results to the performance obtained in 51 with respect to their "within-topic" experiments because our approach is equivalent to theirs, with the only difference that we addressed all the topics in one model.
As a final step, we conducted an explainability analysis to investigate the differences in linguistic style between the truthful and deceptive statements that were correctly classified and misclassified by the model.This procedure aimed to provide a response to Hypothesis 5b, i.e., whether the model takes into account the linguistic style of statements for its final predictions.To achieve this result, we employed the DeCLaRatiVE stylometric analysis.
In Fig. 3, we provided a flow chart of the whole experimental set-up.

Fine-tuning strategy
Fine-tuning of LLMs consists of adapting a pre-trained language model to a specific task by further training the model on task-specific data, thereby enhancing its ability to generate contextually relevant and coherent text in line with the desired task objectives 57 .We fine-tuned FLAN-T5 in its small and base size using the three datasets and following the experimental set-up described above.We approached the lie-detection task as a binary classification problem, given that the three datasets comprised raw texts associated with a binary label, specifically instances classified as truthful or deceptive.
To the best of our knowledge, no fine-tuning strategy is available in the literature for this novel downstream NLP task.Therefore, our strategy followed an adaptation of the Hugginface's guidelines on fine-tuning an LLM for translation.Specifically, we chose the same optimization strategy used to pre-train the original model and the same loss function.
Notably, the classification task between deceptive and truthful statements has never been performed during the FLAN-T5 pre-training phase, nor is it included in any of the tasks the model has been pre-trained on.Therefore, we performed the same experiments, described in the Experimental set-up section, multiple times with different learning rate values (i.e., 1e−3, 1e−4, 1e−5), and we finally chose the configuration shown in Table 4, which yielded the best performance in terms of accuracy.All experiments and runs of the three scenarios were conducted on Google Colaboratory Pro + using their NVIDIA A100 Tensor Core GPU.

Statistical procedure for descriptive linguistic analysis
After applying the DeCLaRatiVE stylometry technique, we obtained a stylistic vector of 26 linguistic features for each text of the three datasets.
In order to assess the significance of the observed differences between the groups, a permutation t-test was employed 63 .This non-parametric method involves pooling all observations and then randomly redistributing them into two groups, preserving the original group sizes.The test statistic of interest (i.e., the difference in means) is then computed for these permuted groups.By repeating this process thousands of times (i.e., n = 10,000), we generated a test statistic distribution under the null hypothesis of no difference between the groups.The observed test statistic from the actual data was then compared to this distribution to compute a p-value, indicating the likelihood of observing such a difference if the null hypothesis was true.The advantage of using a permutation t-test is that no assumption about the distribution of data is needed.This analysis was conducted in Python using SciPy and Pingouin library.For the Memory and Intention dataset, we computed a permutation t-test (n = 10,000) for independent samples for the 26 linguistic features to outline significant differences among the truthful and deceptive texts.
For the Opinion dataset, our analysis proceeded as follows.Firstly, we computed the DeCLaRatiVE stylometry technique for all the subjects' opinions.This resulted in a 2500 (opinions) × 26 (linguistic features) matrix.Then, since each subject provided five opinions (half truthful and half deceptive), we averaged the stylistic vector separately for the truthful and deceptive sets of opinions.This procedure allowed us to obtain two different averaged stylistic vectors for the same subject, one for the truthful opinions and one for the deceptive opinions.Importantly, this averaging process enabled us to obtain results that are independent of the topic (e.g., abortion or cannabis legalization) and the stance taken by the subject (e.g., in favor or against that particular topic).Finally, we validated the statistical significance of these differences by conducting a paired sample permutation test (n = 10,000).Results for each dataset were corrected for multiple comparisons with Holm-Bonferroni correction.
The effect size was expressed by Common Language Effect Size (CLES) with a confidence interval of 95% (95% CI), which is a measure of effect size that is meant to be more intuitive in its understanding by providing the probability that a specific linguistic feature, in a picked-at-random truthful statement, will have a higher score than in a picked-at-random deceptive one 64 .The null value for the CLES is the chance level at 0.5 (in a probability range from 0 to 1) and indicates that, when sampled, one group will be greater than the other, with equal chance.Cohen's d effect size with 95% CI was also computed to add interpretation.

Statistical procedure for explainability analysis
To examine whether the linguistic style of the input statements exerted an influence on the resulting output of the model and to provide explanations for the wrong classification outputs, we applied a DeCLaRatiVE stylometric analysis of statements correctly classified and misclassified by the top-performing model identified in Scenario 3 (FLAN-T5 base).
To this aim, during each iteration from cross-validation, we paired the sentences belonging to the test set and their actual labels with the labels predicted by the model.After the cross-validation ended, for each of the ten folds and for each of the 26 linguistic features of the sentences that composed the test set for that fold, we performed a non-parametric permutation t-test for independent samples (n = 10,000) for the following comparison of interest: To compute the effect size, we computed the average of the CLES and Cohen's d effect size scores with their respective 95% CI obtained from each fold.

Descriptive linguistic analysis
This section outlines the results of the descriptive linguistic analysis in terms of DeCLaRatiVE stylometric analysis to compare the three datasets on linguistic features.Table 4. FLAN-T5 hyperparameters configuration for the small-and base-sized version.The initial learning rate for every scenario was 5e−4 for the small model and 5e−5 for the base model.This choice was motivated by preliminary experiment results, with the smaller model, but not the base model, generally performing better with higher learning rates.The weight decay coefficient was set to 0.01 in all models and Scenarios.The batch size was set to 2 for computational reasons, specifically to avoid running out of available memory, even though it is known that a larger batch size usually leads to better performance.Finally, the number of epochs was set to 3 after preliminary experiments showing the maximum test accuracy after the third epoch without overfitting.www.nature.com/scientificreports/For the three datasets, Figs. 4, 5, and 6 show the differences in the number, the type, the magnitude of the CLES effect size, and the direction of the effect for the linguistic features that survived post-hoc corrections.

Model
To make an example of these differences, the concreteness score of words ('concr_score') presented the largest CLES within the Intention dataset towards the truthful statements (Fig. 6), while in the Opinion dataset, it showed the largest CLES towards the deceptive statements (Fig. 4).Overall, the Intentions dataset displayed fewer significant differences in linguistic features among truthful and deceptive statements than the Opinion and Memory datasets.In Table S5 (Supplementary Information), we reported, for all the linguistic features and the three datasets, all the statistics, the corrected p-values, the effect-size scores expressed by CLES and Cohen's D with 95% CI, and the direction of the effect.

Performance on the lie-detection classification task
This section presents the performance, in terms of averaged accuracy (and standard deviation) of the 10-folds, on the test sets after the last epoch of the small and base model in all the Scenarios.

Scenario 1
In Table 6 are depicted the test accuracies for the FLAN-T5 model, categorized by dataset and model size in Scenario 1.In each case, the base model, on average, outperformed the small model, with the Memory dataset showing the largest improvement of 4% and the Intention dataset showing just a 0.06% increase in average accuracy.These results indicate that the larger model size generally leads to improved performance across the three datasets, with higher accuracy observed in the base version.

Scenario 2
This scenario aimed to investigate our fine-tuned LLM's generalization capability across different deception domains.As presented in Table 5, the test accuracy for the three experiments in this scenario significantly dropped to the chance level, showing that the model, in any case, was able to learn a general rule to detect lies coming from different contexts.

Scenario 3
In Scenario 3, we tested the accuracy of the FLAN-T5 small and base version on the aggregated Opinion, Memory, and Intention datasets.The small-sized FLAN-T5 achieved an average test accuracy of 75.45% (st.dev.± 1.6), while the base-sized FLAN-T5 exhibited a higher average test accuracy of 79.31% (st.dev.± 1.3).In other words, the base-sized model outperformed the small model by approximately four percentage points.
Results in Table 6 show the disaggregated performance on individual datasets between the small and base FLAN-T5 models in Scenario 3, with a comparison to their counterparts in Scenario 1.These comparisons show that FLAN-T5-small in Scenario 3 exhibited worse performance than in Scenario 1. Instead, in Scenario 3, the We identified the top-performing model as the FLAN-T5 base in Scenario 3 because of its higher accuracy in the overall performance.The averaged confusion matrix of the 10 folds for this model is depicted in Fig. 7.
Notably, in any case, we were able to outperform both the bag of word + logistic regression classifier baseline and the performance achieved on the same datasets in previous studies [49][50][51] .

Explainability analysis
This section aims to gain a deeper understanding of the top-performing model identified in Scenario 3 (FLAN-T5 base) through a DeCLaRatiVE stylometric analysis of statements correctly classified and misclassified by the model.The purpose of this analysis was to examine whether the linguistic style of the input statements exerted an influence on the resulting output of the model and to provide explanations for the wrong classification outputs.For this analysis, we compared: The statistically significant features reported survived post-hoc correction for multiple comparisons in each fold.Overall, for comparison a), b), and c), we observed no statistically significant differences (p < 0.05) in any linguistic features for most of the splits with the only exception of:   Conversely, for the d) comparison, several significant features emerged in all the folds and survived corrections for multiple comparisons.Figure 8 depicts the CLES effect size scores of linguistic features, sorted according to the number of times they were found to be significant among the ten folds.The top six features in Fig. 8 represented a cluster of linguistic features related to the Cognitive Load framework.

Discussion
In the present research, we investigated the efficacy of a Large Language Model, specifically FLAN-T5 in its small and base version, in learning and generalizing the intrinsic linguistic representation of deception across different contexts.To accomplish this, we employed three datasets encompassing genuine or fabricated statements regarding personal opinions, autobiographical experiences, and future intentions.

Descriptive linguistic analysis
Descriptive linguistic analysis was performed to compare the three datasets on linguistic features by exploring the differences in the DeCLaRatiVE style, i.e., analyzing 26 linguistic features extracted from the psychological frameworks of Distancing, Cognitive Load, Reality monitoring, and VErifiability approach.This analysis aimed to test Hypothesis 5a, which postulates a variation in the linguistic style that differentiates truthful from deceptive statements across varying contexts (i.e., personal opinions vs. autobiographical memories vs. future intentions).The results from this analysis confirmed our hypothesis, showing that the linguistic features exhibiting statistically significant differences between truthful and deceptive statements indeed varied across datasets.This variation was observed in terms of the total number and type of features, the magnitude of the effect size (from very small to medium), and the direction of the effect.In the following paragraphs, the interpretation of the significant linguistic features of each dataset will be discussed.

Opinions
After analyzing truthful and deceptive opinions using the DeCLaRatiVE stylometry, different linguistic featuresrelated to the theoretical frameworks of CL, RM, and Distancing-were found to be significant.In line with the CL framework, we observed that truthful opinions were characterized by greater complexity, verbosity, and more authenticity in linguistic style 14,31 .
For features related to the RM framework, truthful opinions were characterized by a lesser number of concrete words and a greater number of cognitive words, as also previously shown 55 ; in contrast, deceptive opinions showed higher scores in the concreteness of words, contextual details, and reality monitoring.These differences may reflect on one side the reasoning processes that truth-tellers engage in evaluating the pros and cons of abstract and controversial concepts (e.g., abortion), while for deceivers, it may be indicative of difficulty in abstraction, resulting in faked opinions that sound more grounded in reality.
Finally, in line with previous literature on distancing framework 29,65 and deceptive opinions 20,55 , deceivers utilized more other-related word classes ('Other-reference') and fewer self-related words ('Self-reference'), confirming that individuals may tend to avoid personal involvement when expressing deceptive statements.

Memories
Following the analysis of truthful and deceptive narratives of autobiographical memories through DeCLaRatiVE stylometry, various linguistic features associated with the theoretical frameworks of CL, RM, VA, and Distancing were found to be significant.
As for opinions, according to the CL framework, truthful narratives of autobiographical memories exhibited higher levels of complexity and verbosity and appeared to be more analytical in style 14,31 .
In accordance with the RM framework [32][33][34][35][36][37] , posing that truthful memory accounts tend to reflect the perceptual processes involved while experiencing the event while fabricated accounts are constructed through cognitive operations, we found genuine memories exhibiting higher scores in memory-related words and the number of words associated with spatial and temporal information ('Contextual Embedding'), as well as an overall higher RM score.Conversely, we found deceptive memories showing higher scores in words related to cognitive processes (e.g., reasoning, insight, causation).Furthermore, in line with Kleinberg's truthful concreteness hypothesis 39 , truthful memories were overall characterized by words with higher scores of concreteness.
Along with the VA, truthful memories contained more verifiable details, as indicated by the greater number of named entities about times and locations 23,48 .Notably, we found this effect although participants lied in a low-stake scenario.However, deceptive memories were unexpectedly characterized by a higher number of self-references and named entities of 'People' .This result is in contrast with previous literature on distancing framework 14,29 .One possible explanation of this significant but small effect is that liars may try to increase their credibility by fostering a sense of social connection.

Intentions
Upon examining truthful and deceptive statements of future intentions through DeCLaRatiVE stylometry, several linguistic features were found to be significant.Our findings are consistent with previous research claiming that genuine intentions contain more 'how-utterances' , i.e., indicators of careful planning and concrete descriptions of activities.In contrast, false intentions are characterized by 'why-utterances' , i.e., explanations and reasons base in Scenario 3. The bar plot shows the averaged Common Language Effect Size among the ten folds of linguistic features that survived post-hoc corrections.Linguistic features are sorted in descending order according to the number of times they were found to be significant among the 10 folds (displayed at the side of each bar).Linguistic features higher on average in truthful texts are shown in sky blue, while those higher on average in deceptive texts are shown in salmon.
for why someone planned an activity or for doing something in a certain way 48 .Indeed, we found true intentions were more likely to provide concrete and distinct information about the intended action, grounding their statements in real-world experiences and providing temporal and spatial references.Additionally, true intentions were characterized by a more analytical style and a greater presence of numerical entities.In contrast, false intentions exhibited a higher number of cognitive words and expressions and were temporally oriented toward the present and past.
Furthermore, we found evidence in line with the claim that liars may over-prepare their statements 48 , as indicated by higher verbosity.Finally, in contrast with the distancing framework 14,29 , we found a significantly higher proportion of self-references and mentions of people in deceptive statements.However, the effect size for this finding was small.As for deceptive memories, one possible interpretation is that liars may attempt to appear more credible by creating a sense of social connection.

Lie detection task
In order to test the capacity of the FLAN-T5 model to be fine-tuned on a Lie Detection task, we developed three scenarios.
In Scenario 1, we tested whether fine-tuning LLMs can effectively classify the veracity of short statements based on raw texts with performance highly above the chance level (Hypothesis 1a).To this aim, we fine-tuned FLAN-T5 in its small version to perform lie detection as a classification task.We repeated this procedure for the three datasets (i.e., opinions vs. memories vs. intentions).This fine-tuning process yielded promising results confirming our hypothesis, with an average accuracy of 80.64% (st.dev.± 2.03%) for the Opinion dataset, 76.87% (st.dev.± 2.06%) for the Memory dataset, and 71.46% (st.dev.± 3.65%) for the Intention dataset.
In Scenario 2, we tested whether fine-tuning an LLM on deceptive narratives enables the model to detect new types of deception (Hypothesis 2).To verify this hypothesis, we fine-tuned FLAN-T5 (small version) on two datasets and tested on the third one (e.g., train: opinion + memory; test: intention).Our findings show that the model performed at chance level in all three combinations of this Scenario, suggesting that there are no universal rules the model can learn to distinguish truthful from deceptive statements, enabling a generalization of the task across different contexts.Indeed, as shown in the Descriptive Linguistic Analysis section, the three datasets differed significantly in terms of the content and the linguistic style by which truthful and deceptive narratives are delivered.Therefore, the model struggled to identify a specific pattern of linguistic deception and appeared to engage a domain-specific learning, tailoring its classification capabilities to that specific domain of deception.
In Scenario 3, we tested whether fine-tuning an LLM on a multiple-context dataset enables the model to obtain successful predictions on a multi-context test set (Hypothesis 3).At this aim, we fine-tuned and tested FLAN-T5 (small version) with the three aggregated datasets (i.e., opinion + memory + intention).The small-sized FLAN-T5 achieved an average accuracy of 75.45% (st.dev.± 1.6).Additionally, the disaggregated performance on individual datasets compared to their counterpart in Scenario 1 exhibited solely a small decrease in accuracy (around 1%).These findings confirmed our hypothesis, providing evidence of LLMs' ability to generalize when fine-tuned and texted on a multi-context dataset, in contrast to previous empirical evidence showing a decline in performance in machine learning models on the same scenarios [53][54][55] .
To test whether the model performance increases when employing larger models (Hypothesis 4), we repeated the same experiments in Scenarios 1, 2, and 3 with the base version of FLAN-T5.
In Scenario 1, we found that the base version of FLAN-T5 provided higher accuracy than the small version.In Scenario 3, the base version of the model achieved an average accuracy of 79.31% (st.dev.± 1.3), outperforming the small model by approximately four percentage points.Additionally, this increase in the general accuracy did not compromise the performance on any individual dataset when compared to what achieved by the smaller model or by the FLAN-T5 base in Scenario 1.In contrast, the base version of FLAN-T5 in Scenario 2 still obtained performance around the chance level.
On one hand, the findings obtained from the base model in Scenarios 1 and 3 confirmed the hypothesis that the model size does influence the performance, likely because a bigger model is able to learn a better representation of linguistic patterns of genuine and deceptive narratives.Specifically, in Scenario 3, the FLAN-T5 base, with its larger size, possessed the capability to comprehend and integrate the features of the three distinct datasets altogether, thereby maintaining consistent performance across all individual datasets.In contrast, the smaller FLAN-T5 in Scenario 3 seemed to relinquish certain specialized abilities that are beneficial for specific datasets to classify deception across different contexts.
On the other hand, findings from Scenarios 2 and 3 (with small and base FLAN-T5) showed that LLMs, despite having acquired a comprehensive understanding of language patterns, still require exposure to prior examples to accurately classify deceptive texts within different domains.
Finally, to test whether our approach outperforms classical machine learning and deep learning approaches in verbal lie detection (Hypothesis 1b), we compared the results obtained from FLAN-T5 in its small and base versions with the performance of a simpler baseline of a logistic regressor based on BoW embedding 62 and of Transformer models previously employed in the literature on the Opinion 51 and Intention datasets 49,50 .
Specifically, when comparing the Memory dataset to the logistic regression baseline, there was a 32% increase in performance.This improvement might be attributed to the longer and more complex nature of the stories in the Memory dataset, which challenges the effectiveness of more straightforward methods like logistic regression based on BoW in a lie detection task.In contrast, LLMs already possess a robust language representation; thus, fine-tuning LLMs leverages this representation, tailoring their NLP proficiency specifically for a lie detection task, yielding higher accuracy.
The performance gained by fine-tuning LLMs was less pronounced for the Opinion and Intention datasets.For the Opinion dataset, this could be due to the relative ease of classification in these datasets, where simpler www.nature.com/scientificreports/models can already achieve good performance, leaving a smaller margin for improvement.Nonetheless, the difference between our approach and the baselines is not negligible.In the Opinion dataset, we outperformed the literature baseline of a Transformer model trained from scratch by 17% accuracy and surpassed our logistic regression baseline by six percentage points.For the Intention dataset, our approach showed a 5-percentage point improvement over the logistic regression baseline and around 1-2% improvement over the best literature baseline.Notably, the best literature baseline for the Intention dataset (averaged accuracy: 70.61 ± 2.58%) used a similar approach to ours in terms of the type of model used, involving a Transformer-based model (BERT + Coattention), which may explain the narrower performance gap.Besides the differences in performance, the main advantage of our approach is its simplicity and flexibility compared to those used in previous studies [49][50][51] .Fine-tuning an LLM leverages an existing encoding of language that effortlessly handles any type of statement, unlike logistic regression based on BoW or training a new Transformer-based model from scratch.Taking all these aspects together, fine-tuning LLMs resulted in being more advantageous in terms of feasibility, flexibility, and performance accuracy.

Explainability analysis
To improve the explainability of the performance collected, we investigated whether the linguistic style that characterizes truthful and deceptive narratives could have a role in the model's final predictions (Hypothesis 5b).For this aim, we applied a DeCLaRatiVE stylometric analysis on statements that were correctly classified and misclassified by the top-performing model identified in Scenario 3 (i.e., FLAN-T5 base).
In the misclassified sample, truthful and deceptive statements did not differ significantly for any linguistic feature extracted with the DeCLaRatiVE stylometry technique.The only exception was fold 1, which showed significant differences in the text's readability score, and fold 6, which showed significant differences in 'Reality Monitoring' scores.No significant differences were detected in each fold in linguistic features between deceptive statements that were correctly classified as deceptive (True Negatives) and truthful statements that were misclassified as deceptive (False Negatives), with the exception of 'Reality Monitoring' in folds 6 and 7 and 'Contextual Embedding' score in fold 7. Finally, truthful statements that were correctly classified as truthful (True Positives) and deceptive statements that were misclassified as truthful (False Positives) exhibited no significant differences, except for the number of syllables and number of words in the fold 9. We argue that the observation of significant differences in selected linguistic features across specific folds is more indicative that these findings may not be generalizable and are likely influenced by the particular fold under analysis.When taken together, most of the analyzed folds showed a substantial overlap in linguistic style.Consequently, the model might have exhibited poor classification performance for those statements because, while deceptive, they showed a linguistic style resembling truthful statements and vice-versa.
In contrast, correctly classified statements displayed several significant differences between truthful and deceptive statements.Notably, the top six linguistic features in Fig. 8 resulted in statistical significance in at least 6 out of 10 folds.The fact that we found a consistent pattern of linguistic features in correctly classified statements but not in misclassified statements provides evidence for our hypothesis, suggesting that the linguistic style of statements does have a role in the model's final predictions.More in detail, the top-six linguistic features depicted in Fig. 8 represent a cluster of linguistic cues associated with the CL framework 31 , specifically low-level features related to the length, complexity, and analytical style of the texts that may have enabled the distinction between truthful and deceptive statements.The fact that linguistic cues of CL survived among the several features available -in a mixed dataset of utterances reflecting opinions, memories, and intentions-raises the question of whether CL cues may be more generalizable than other cues that are, in contrast, more specific to a particular type of deception.

Conclusion, limitations, and further work
At the time of writing and to the best of our knowledge, this is the first study involving the use of an LLM for a lie-detection task.
LLMs are Transformer-based models trained on large corpora of text that have proven to generate coherent text in human natural language and have extreme flexibility in a wide range of NLP tasks 28 .In addition, these models can be further fine-tuned on specific tasks using smaller task-specific datasets, achieving state-of-the-art results 28 .In this study, we tested the ability of a fine-tuned LLM (FLAN-T5) on lie-detection tasks.
First, given the extreme flexibility of LLM, we tested whether fine-tuning a LLM is a valid procedure to detect deception from raw texts above chance level and outperform the classical machine and deep learning approaches.We found that fine-tuning FLAN-T5 on a single dataset is a valid procedure to obtain a state-of-the-art accuracy, as proved by the fact that this procedure outperformed the baseline model (BoW + logistic regression) and previous works that applied machine and deep learning techniques on the same datasets [49][50][51]62 .
Second, we wanted to investigate whether fine-tuning an LLM on deceptive narratives enables the model to also detect new types of deceptive narratives.Findings from Scenario 2 disconfirms this hypothesis, suggesting that the model requires previous examples of different deceptive narratives to provide adequate accuracy in this classification task.
Third, we investigated whether it is possible to successfully fine-tune an LLM on a multiple-context dataset.Results from Scenario 3 confirm that fine-tuned LLM may provide adequate accuracy in detecting deception from different contexts.We also found that fine-tuning on multiple datasets can increase the performance with respect to when fine-tuned on a single dataset.
Furthermore, we hypothesized that the model performance may depend on the model size, given that the larger the model, the better the model forms its inner representation of language.Results from Scenario 1 and 3 confirmed that the base-sized model of FLAN-T5 provides higher accuracy than the small-sized version.www.nature.com/scientificreports/Finally, with our experiments, we introduced the DeCLaRatiVE stylometry technique, a new theory-based stylometric approach to investigate deception in texts from four psychological frameworks (Distancing, Cognitive Load, Reality Monitoring, and Verifiability approach).We employed the DeCLaRatiVE stylometry technique to compare the three datasets on linguistic features and we found that fabricated statements from different contexts exhibit different linguistic cues of deception.We also employed the DeCLaRatiVE stylometry technique to conduct an explainability analysis and investigate whether the linguistic style by which truthful or deceptive narratives are delivered is a feature that the model takes into account for its final prediction.At this aim, we compared correctly classified and misclassified statements by the top-performing model (FLAN-T5 base in Scenario 3), finding that correctly classified statements share linguistic features related to the cognitive load theory.In contrast, truthful and deceptive misclassified statements do not present significant differences in linguistic style.
Given the results achieved, we highlight the importance of a diversified dataset to achieve a generalized good performance.We also considered crucial the balance between the diversity of the dataset and the size of the LLM, suggesting that the more diverse the dataset is, the bigger the model required to achieve higher-level accuracy.The main advantage of our approach consists of its applicability to raw text without the need for extensive training or handcrafted features.
Despite the demonstrated success of our model, three significant limitations impact the ecological validity of our findings and their practical application in real-life scenarios.
The first notable limitation pertains to the narrow focus of our study, which concentrated solely on lie detection within three specific contexts: personal opinions, autobiographical memories, and future intentions.This restricted scope limits the possibility of accurately classify deceptive texts within different domains.A second limitation is that we exclusively considered datasets developed in experimental set-ups designed to collect genuine and completely fabricated narratives.However, individuals frequently employ embedded lies in real-life scenarios, in which substantial portions of their narratives are true, rather than fabricating an entirely fictitious story.Finally, the datasets employed in this study were collected in experimental low-stake scenarios where participants had low incentives to lie and appear credible.Because of all the above issues, the application of our model in real-life contexts may be limited, and caution is advised when interpreting the results in such situations.
The limitations addressed in this study underscore the need for future research to expand the applicability and generalizability of lie-detection models for real-life settings.Future works may explore the inclusion of new datasets, trying different LLMs (e.g., the most recent GPT-4), different sizes (e.g., FLAN-T5 XXL version), and different fine-tuning strategies to investigate the variance in performance within a lie-detection task.Furthermore, our fine-tuning approach completely erased the previous capabilities possessed by the model; therefore, future works should also focus on new fine-tuning strategies that do not compromise the model's original capabilities.
13:22849 | https://doi.org/10.1038/s41598-023-50214-0www.nature.com/scientificreports/training and test sets, such that any individual who had their opinions assigned to the training set did not have their opinions assigned to the test set, and vice versa.

Figure 1 .
Figure 1.Visual illustration of the Scenarios 1 and 3.

Figure 3 .
Figure 3. Visual illustration of the whole experimental set-up.The Opinion, Memory, and Intention dataset underwent Descriptive Linguistic Analysis using DeCLaRatiVE stylometry.A baseline model consisting of Bag of Words (BoW) and Logistic Regression (Scenario 1) was also established for the three datasets.Then, the FLAN-T5 model in small and base versions was fine-tuned across Scenarios 1, 2, and 3. Finally, an Explainability Analysis was conducted on the top-performing model using DeCLaRatiVE stylometry to interpret the results.
a. Truthful statements misclassified as deceptive (False Negatives), with deceptive statements misclassified as truthful (False Positives); b.Statements correctly classified as deceptive (True Negatives) vs. truthful statements misclassified as deceptive (False Negatives); c. Statements correctly classified as truthful (True Positives) vs. deceptive statements misclassified as truthful (False Positives).d.Truthful versus deceptive statements correctly classified by the model (True Positives vs. True Negatives).

Figure 4 .
Figure 4. Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the significant linguistic features that survived post-hoc corrections in the Opinion dataset.The CLES estimates represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful opinions (sky blue) than in deceptive ones (salmon).The CLES for truthful opinions are sorted in descending order, while the CLES for deceptive opinions are sorted in ascending order.

Figure 5 .
Figure5.Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the significant linguistic features that survived post-hoc corrections in the Memory dataset.The CLES estimates represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful memories (sky blue) than in deceptive ones (salmon).The CLES for truthful memories are sorted in descending order, while the CLES for deceptive memories are sorted in ascending order.

Figure 6 .
Figure 6.Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the significant linguistic features that survived post-hoc corrections in the Intention dataset.The CLES estimates represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful intentions (sky blue) than in deceptive ones (salmon).The CLES for truthful intentions are sorted in descending order, while the CLES for deceptive intentions are sorted in ascending order.
a. Truthful statements misclassified as deceptive (False Negatives), with deceptive statements misclassified as truthful (False Positives); b.Statements correctly classified as deceptive (True Negatives) vs. truthful statements misclassified as deceptive (False Negatives); c. Statements correctly classified as truthful (True Positives) vs. deceptive statements misclassified as truthful (False Positives).d.Truthful vs. deceptive statements correctly classified by the model (True Positives vs. True Negatives).

Figure 7 .
Figure 7. Averaged confusion matrix of the top-performing model identified as FLAN-T5 base in Scenario 3. In each square, the results obtained represent the average (and standard deviation) from the test set of each iteration of the 10-fold cross-validation.

Figure 8 .
Figure 8. Linguistic features in Truthful and Deceptive statements that were accurately classified by FLAN-T5base in Scenario 3. The bar plot shows the averaged Common Language Effect Size among the ten folds of linguistic features that survived post-hoc corrections.Linguistic features are sorted in descending order according to the number of times they were found to be significant among the 10 folds (displayed at the side of each bar).Linguistic features higher on average in truthful texts are shown in sky blue, while those higher on average in deceptive texts are shown in salmon. https://doi.org/10.1038/s41598-023-50214-0 https://doi.org/10.1038/s41598-023-50214-0

Table 2 .
Summary statistics of the number of words for each dataset and truthful and deceptive set of statements.Jaccard Similarity Index and its qualitative interpretation in brackets refers to the similarity between truthful and deceptive vocabulary sets for each dataset.

Table 3 .
List and short description of the 26 linguistic features pertaining to the DeCLaRatiVE Stylometry technique.

Table 5 .
Test accuracy of FLAN-5 models in scenario 2 (three combination of train sets).The performance comparison is among the small and base version of the FLAN-T5 model in the three combination of train set: opinion + memory, opinion + intention, memory + intention.

Table 6 .
50st acccuracy of the FLAN-T5 models in Scenarios 1 and 3 for the three datasets.Reported values are means ± standard deviation of the 10 folds.Best results per evaluation metric are in bold.The literature baseline for the Opinion dataset refers to the average accuracy and standard deviation from all within-topic accuracies from FastText Embedding + Transformer51.The literature baseline for the Intention dataset refers to the accuracy from Vanilla Random Forest using LIWC features (confidence interval in square brackets)49, the averaged accuracy and standard deviation from RoBERTa + Transformers + Co-Attention model and BERT + co-attention model50respectively.